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In microarray technology, a number of critical steps are required 
to convert the raw measurements into the data relied upon by biol- 
ogists and clinicians. These data manipulations, referred to as pre- 
processing, influence the quality of the ultimate measurements and 
studies that rely upon them. Standard operating procedure for mi- 
croarray researchers is to use preprocessed data as the starting point 
for the statistical analyses that produce reported results. This has 
prevented many researchers from carefully considering their choice 
of preprocessing methodology. Furthermore, the fact that the pre- 
processing step affects the stochastic properties of the final statisti- 
cal summaries is often ignored. In this paper we propose a statisti- 
cal framework that permits the integration of preprocessing into the 
standard statistical analysis flow of microarray data. This general 
framework is relevant in many microarray platforms and motivates 
targeted analysis methods for specific applications. We demonstrate 
its usefulness by applying the idea in three different applications of 
the technology. 

1. Introduction. Microarray technology measures the quantity of nucleic 
acid molecules present in a biological sample referred to as the target. To 
do this, we take advantage of hybridization properties of nucleic acid and 
use complementary molecules immobilized on a solid surface, referred to as 
probes. The nucleic acid molecules in the target sample are labeled with fluo- 
rescent dyes and hybridized to the probes. A specialized scanner is then used 
to measure the amount of fluorescence at each probe which is reported as 
an intensity. A defining characteristic of microarray technology is that it in- 
cludes thousands of probes on a relatively small surface such as a glass slide. 
Various manufacturers provide a large assortment of different platforms. 
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In this paper we refer to the probe hybridizing to its target phenomenon 
as specific binding. We define this term because, in practice, probes hybridize 
to the noncomplementary molecules as well. We refer to this phenomenon as 
nonspecific binding. Observed intensities are a combination of optical noise, 
a nonspecific binding component and a specific-binding component. Most 
manufacturers try to measure the optical and nonspecific components di- 
rectly by including additional features or obtaining readings from areas of 
the chip with no probe. We refer to the intensities read for each of these fea- 
tures as probe-level data. In practice, various sources of variation need to be 
accounted for and these data are heavily manipulated before one obtains the 
genomic-level measurements that most biologists and clinicians use in their 
research. This procedure is commonly referred to as preprocessing. Most pre- 
processing procedures only provide point estimates for each genomic unit per 
sample. Although some methods provide measures of uncertainty for these 
estimates, they are often discarded in higher level analysis [Rattray et al. 
(2006)]. In this paper we introduce a statistical framework that permits the 
integration of preprocessing into the standard statistical analysis of microar- 
ray data. This general framework can be adapted to a variety of biological 
applications of microarray technology. We present the framework in Section 
3 and demonstrate its use in Section 4 with three examples. 

For the statistical framework presented in this paper it is convenient to 
divide the many different available platforms into two main classes. These 
are differentiated by the type of data they produce. We refer to the platforms 
that produce one set of probe-level data per microarray with some probes de- 
signed to measure specific binding and others to measure nonspecific binding 
as the high density oligonucleotide platforms. Affymetrix™GeneChip® arrays 
are by far the most popular manufacturer of this technology. The two-color 
platforms produce two sets of probe-level data per microarray (the red and 
green channels) and local optical noise levels are measured from parts of 
the glass slide containing no probes. No single company or academic lab 
dominates this market. Notice that there are a handful of platforms that do 
not fall into either of these categories. However, the vast majority of data 
produced in the past 5 years do. 

The most popular microarray application of both platforms is measuring 
genome-wide transcription activity. In this application each gene is repre- 
sented by one or more probes that will hybridize with the RNA transcribed 
from that gene. In practice, researchers using microarrays for this purpose 
start out with the probe-level data. However, most microarray products 
come equipped with softwares that preprocess these data into higher level 
measurements where each gene gets assigned one value on each array. This 
value is presented as the starting point for analyses that eventually lead to 
the results published in the scientific literature. Examples of these higher 
level analyses are identifying differentially expressed genes, class discovery 
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and class prediction. In some cases, the data manipulations performed in the 
preprocessing step turn out to be rather complicated. Three steps typically 
carried out in preprocessing are: 

1. Adjusting probe intensities for optical noise and/or nonspecific binding. 
This task is referred to as background correction. 

2. Adjusting probe intensities to remove systematic biases due to technical 
variations such as different labeling efficiency, scanner setting or physical 
problems with the arrays. This task is referred to as normalization. Note 
that this does not necessarily mean that we transform the data to have 
a normal distribution, as the term traditionally implies in statistics. 

3. When multiple probes represent a gene, summarizing the observed inten- 
sities to attain one number for each gene. We will refer to this step as 
summarization. 

We will refer to these as the three main preprocessing tasks. For both plat- 
form classes, many different approaches have been proposed for each of 
these three steps, resulting in competing preprocessing algorithms. Most 
of these preprocessing algorithms do not try to estimate the measures of un- 
certainty that accompany the resulting gene-level expression estimates. For 
example, nonlinear normalization routines, such as quantile normalization 
[Amaratunga and Cabrera (2001)] and "variance stabilizing normalization" 
[Huber et al. (2002)], can artificially reduce variation of the gene-level mea- 
surements. This fact is rarely taken into account in the higher level analyses. 
Notice that, for researchers with the luxury of numerous array replicates, 
this is not necessarily a problem because measures of uncertainty can be 
estimated from the gene-level data. However, this situation is not common 
in academia and governmental institutions. Thus, for most microarray ex- 
periments, it becomes important to obtain as much information as possible 
about the stochastic properties of the final summary statistics from the 
probe-level data. By posing models for these data, any manipulation could 
be described statistically and bottom line results can be better understood. 

Microarrays are now being used to measure genomic endpoints other than 
gene expression, including yeast mutant representations, the presence of Sin- 
gle Nucleotide Polymorphisms (SNPs), presence of deletions/insertions, and 
protein binding sites by chromatin immunoprecipitation (known as ChlP- 
chip). In each case, the units of measurement continue to be the probes. 
Without appropriate understanding of the bias and variance of these mea- 
surements, biological inferences based upon probe analysis will be compro- 
mised. In Section 3 we present a general statistical framework which consists 
of a stochastic model for probe-level data, useful for any microarray applica- 
tion, and procedures for quantifying answers of scientific interest that permit 
measuring the statistical properties introduced by the three main prepro- 
cessing tasks. In Section 4 we give examples of the usefulness of our proposal 
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in three specific applications of microarray technology: detecting expressed 
genes, estimation of differential expression, and identification of synthetic 
lethality and fitness defects in yeast mutants. Data used in the first two 
examples are from a high-density oligonucleotide platform and data used in 
the third example are from a two-color platform. 

2. Previous work. Various research groups have demonstrated that sta- 
tistical methodology can provide great improvements over the ad-hoc pre- 
processing procedures offered as defaults by the companies producing the 
arrays. The implementation of these methods have resulted in useful pre- 
processing algorithms which have already provided better scientific results 
for users of gene expression arrays. Most of these procedures perform all 
three main preprocessing tasks. However, some approaches follow a step-by- 
step/modular approach, and others follow a global/unified approach. 

For a detailed description of the three major preprocessing tasks and a 
review of some of the most popular preprocessing methodologies, we refer 
the readers to our working paper [Wu and Irizarry (2005)]. Here we de- 
scribe the additive-background-multiplicative-error (addimult) model that 
has been implicitly or explicitly assumed to motivate most of the widely 
used preprocessing procedures. 

2.1. The addimult model. After target RNA samples are prepared, la- 
beled and hybridized with arrays, these are scanned and images are pro- 
duced and processed to obtain an intensity value for each probe. These 
intensities represent the amount of hybridization for each probe. However, 
part of the hybridization is nonspecific and the intensities are affected by 
optical noise. Therefore, the observed intensities need to be adjusted to give 
accurate measurements of specific hybridization. In this paper we refer to 
the part of the observed intensity due to optical noise and nonspecific bind- 
ing as background noise. Wu et al. (2004) describe experiments useful for 
understanding background noise behavior that empirically confirm that its 
effect is additive and its distribution has nonzero mean. 

The component of the observed intensities related to specific binding is 
also affected by probe properties as well as measurement error. By using 
the log-scale transformation before analyzing microarray data, many inves- 
tigators have implicitly assumed a multiplicative measurement error model 
[Dudoit et al. (2002), Kerr, Martin and Churchill (2000), Newton et al. (2001), 
Wolfinger et al. (2001)]. Furthermore, various groups, for example, Li and 
Wong (2001), have demonstrated the existence of strong multiplicative probe 
effects on the ability to measure specific signals. 

Most ad-hoc preprocessing algorithms subtract background and then take 
the log which arguably implies an addimult model. However, Cui et al. 
(2003), Durbin et al. (2002), Huber et al. (2002) and Irizarry et al. (2003a) 
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have explicitly proposed addimult models and motivated algorithms based 
on these. A general form of this model is simply 

(1) Y = B + S, 

with Y the observed intensity, B the background noise component and S 
the specific binding component which includes multiplicative effects. 

2.2. Modular versus unified approaches. The three main preprocessing 
tasks can be performed sequentially and produce gene level measures for 
each gene on each array. Further higher level analyses use the summaries 
from preprocessing as input data. We refer to this type of data flow as the 
modular approach. One disadvantage of the modular approach is that often 
the effect of the preprocessing steps on the stochastic properties of the final 
statistical summaries is ignored. For example, the gene expression measures 
produced by preprocessing procedures often have different uncertainties. Not 
all preprocessing methods provide a measure of this uncertainty. Even when 
an uncertainty measure is provided, most high level analysis methods do not 
make use of it. There are some exceptions, for example, Liu et al. (2006) 
and Rattray (2006) show that propagating the uncertainty in gene expression 
measures can improve accuracy in detecting differential gene expression and 
principle component analysis. They still use a modular approach, but include 
the variance obtained from preprocessing as part of variance of the log scale 
expression level in high level analysis. 

Another disadvantage of the stepwise modular approach is that each step 
is independently optimized without considering the effect of previous or sub- 
sequent steps. This could lead to sub-optimal bottom-line results. Various 
investigators have used the addimult model to combine the background ad- 
justment and normalization step into a unified estimation procedure. For 
example, Durbin et al. (2002), Huber et al. (2002), Geller et al. (2003) and 
Cui et al. (2003) use addimult models to motivate a transformation of the 
data that removes the dependence of the variance on the mean intensity lev- 
els. However, these procedures do not define and estimate parameters that 
represent quantities related to a scientific question as we wish to accomplish 
with our general framework. 

Some methods have been proposed to estimate, or test for, differential ex- 
pression as part of a more general estimation procedure that performs some 
of the main preprocessing tasks. For example, Kerr et al. (2000) propose the 
use of ANOVA models to test for differential expression across different pop- 
ulations in two-color arrays. Their models include parameters to account for 
the need for normalization. However, the background adjustment step is per- 
formed separately. Wolfinger et al. (2001) propose a similar model that per- 
mits some of the effects to be random. This group developed the equivalent 
approach for high-density oligonucleotide arrays [Chu, Weir and Wolfinger 
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(2002)]. In both approaches no background adjustment is performed. Hein et 
al. (2005) propose a Bayesian model for high-density ohgonucleotide arrays 
that combines background adjustment and summarization, and permits the 
possibihty of estimating more meaningful parameters along with credibility 
intervals. However, the normalization task is not addressed and probe effects 
are not considered in the summarization. 

In the next section we propose a statistical framework that will permit us 
to estimate parameters of interest and perform all three main preprocessing 
tasks in one estimation procedure. The measures of uncertainty will therefore 
account for the preprocessing. 

3. A general statistical framework. The first step in our proposed frame- 
work is the definition of target DNA/RNA molecule of interest. For example, 
in expression arrays, we are interested in RNA transcripts. Then, for each 
target molecule, a set of probes, that will provide specific binding measure- 
ments for this target, are identified. Probes that provide information about 
nonspecific binding are also identified. Finally, answers to scientific ques- 
tions related to these target molecules can be quantified as summaries of 
the parameters in the following statistical model: 



with g = 1, . . . ,G,i = 1, . . . ,1 ,j = 1, . . . , Jg and h = l,. . . ,H. 

Here Y^^j is the probe intensity read from a probe of type h, for target 
molecule g, in array i, and probe j. For example, in GeneChip arrays h = l,2 
will correspond to PM or MM and in two-color arrays to Red or Green. The 
target molecules of interest, such as mRNA transcripts or SNP sites, are 
indexed with g. The different probes used to represent a target are denoted 
with j. In many cases, for example, most two-color platforms, only one probe 
is used and j can be omitted. 

The probe intensity contains three major components: optical noise O, 
intensities due to nonspecific and specific binding, N and S. In our model 
O can depend on the various indexes, but in all our examples we consider 
it to be a constant for each array. The N and S components can be further 
decomposed into 



The mean level of nonspecific intensity for the jth probe of type h related 
to target molecule g is represented by figj , and random effects that explain 
differences from array to array are denoted with (^g^j. The fact that the N 
is strictly positive explains the need for background adjustment. If target 
molecule g is present, then the specific binding component Sg is formed by 





(3) 



Nj^ij = expifi^ij + and 
Sgij = exp(i/f + e'ii + <P^„i^ + •), 
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an array specific constant v that explains the need for normaUzation, a log- 
scale probe effect 0, measurement error e and a quantity proportional to 
the amount of transcript exp{0}. For example, in two-color arrays, Q'^'^^ and 
^green j-gp^gggnt the Specific binding in the two channels. In GeneChip arrays, 
qPM j-gpj-gggnt the specific binding on the PM probe, and represent the 
intensity due to binding of the PM target on the MM probe. It has been 
observed that, at least for some probes, Q^^ > [Wu and Irizarry (2004)]. 

The distribution of stochastic components in (3) will depend on the plat- 
form and application. However, we model ^ with a normal distribution in 
the examples presented in this paper. Using an experiment designed specifi- 
cally to motivate a stochastic model for background noise, Wu et al. (2004) 
demonstrate this is a reasonable assumption for GeneChip arrays. Below we 
present evidence that the log-normal assumption applies to two-color plat- 
forms as well. If we remove outliers, the normal assumption appears to be 
useful to model e as well. 

Notice that some of the models motivating the unified preprocessing al- 
gorithms described in Section 2 are special cases of model (3). An example 
is the model proposed by Durbin et al. (2002) for two-color platforms. To 
obtain their model from ours, we need to assume is and that O is 
normally distributed. Instead of estimating 9, Durbin et al. (2002) derive 
a transformation t for which the variance of A = t{Y^) — t{Y^) does not 
depend on the expectation of and . The difference A is used as a mea- 
sure of relative expression on the two samples. Huber et al. (2002) follow 
a similar approach. Unlike Durbin et al. (2002), they explicitly include (p 
and v in their model. As in Durbin et al. (2002), they consider A^ -|- O to 
be normally distributed. Because their procedure was originally developed 
for two-color arrays, 4> is absorbed into 9. Using an ad-hoc robust version 
of maximum likelihood estimation, the parameters are estimated to derive 
a transformation similar to the one proposed by Durbin et al. (2002). The 
model described by Kerr et al. (2000) is also a special case of ours. They 
assume Y has been background adjusted and, therefore, that O and A^ are 
0. They incorporate the estimation of differential expression with the nor- 
malization step by permitting 9^^ to be constant for measurements from the 
same population. Hein et al. (2005) use our model as well, but impose further 
assumptions on the distribution of the parameters. The u and (p parameters 
are not accounted for though. 

The approaches described by Durbin et al. (2002) and Huber et al. (2002) 
assume O -|- A^ to be IID normal for each hybridization. As mentioned, em- 
pirical evidence suggests that this assumption is incorrect and that the dis- 
tribution of the background component is heavily right skewed. Therefore, 
a log-normality assumption is more appropriate. This incorrect assumption 
of normality has a relatively large impact on the accuracy of the expression 
level estimate. Figure 1 compares the resulting expression estimates ob- 
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Fig. 1. Log {base 2) expression estimates plotted against nominal log {base 2) concen- 
tration in picoMolar, computed with background adjustment described in the text. To make 
the curves comparable, the lines are shifted so that they have the same expression at log 
concentration 8 picoMolar (3 in log base 2). 



tained from using VSN procedures proposed by Huber et al. (2002) to the 
GCRMA procedure which uses a log-normal assumption [Wu et al. (2004)], 
using data from an assessment experiment (described in more detail in Sec- 
tion 4.1.3). The result from the generalized-log (glog) proposed by Durbin 
et al. (2002) is almost indistinguishable from VSN and is omitted. We also 
compare these to a procedure that is just like GCRMA, but does no back- 
ground correction at all. The figure shows averaged log (base 2) expression 
estimates plotted against known log (base 2) concentration levels for data 
from an assessment experiment (described in more detail in Section 4.1.3). 
Appropriate background adjustment will yield a straight line and, according 
to our model, no background adjustment will yield flat local slopes for low 
concentrations. Notice that the procedures using the normality assumption 
are almost equivalent to not correcting for background. 

Although the proposal of using a log-normal distribution for the back- 
ground noise provides great practical improvements, the major advantage 
of our statistical framework is that it will permit us to describe final results 
of scientific interest with rigorous statistical statements. We will be able to 
quantify scientific questions as an exercise of optimizing estimation of a set 
of model parameters. With the proper model in place, fitting the model will 
produce direct estimates of the parameters of interest along with uncertainty 
measures that take into account the effects of the three main preprocessing 
tasks. 



FRAMEWORK FOR MICROARRAY PROBE-LEVEL DATA 



9 



Fitting model (3) in practice will sometimes be challenging. Many param- 
eters in model (3) are not identifiable without certain constraints. However, 
the platform designs usually impose constraints that allow the parameters 
to be identified. For example, in GeneChip arrays we will assume that the 
probe-effects does not depend on array i and that does not de- 
pend on the probe-type h. In two-color platforms we will assume that (pg^j 
does not depend on probe- type h. In platforms that use isothermal design 
[Wang et al. (2006b)] to minimize the range of optimal hybridization tem- 
perature of all probes, it is possible to omit <pgij- Other application specific 
assumptions that make the model more parsimonious will be demonstrated 
by examples in Section 4. 

The choice of which components in the model are random and which are 
fixed will also vary from application to application. In some applications we 
may model (pgi with a normal distribution that does not depend on i or (7 as 
done by Wolfinger et al. (2001). In cases where we assume the variance of 
£ depends on g, then assuming this variance follows, for example, a gamma 
distribution across g will add power to the analysis. For gene-level data, 
these types of hierarchical models have greatly improved results in practice, 
such as the ability of finding differential expressed genes. For example, see 
Lonnstedt and Speed (2002), Smyth (2004), Gottardo et al. (2003), Pan et 
al. (2003) and Kendziorski et al. (2003). However, in spite of the improve- 
ments and the easy interpretation of the results (e.g., posterior probability 
of differential expression), we often find Bayesian inference computationally 
expensive, as observed by other researchers [Liu et al. (2006)]. In the cases 
when the cost of computing becomes impractical, ad-hoc versions are pos- 
sible. In these cases we can still use the model assumptions to describe the 
statistical characteristics of the resulting data summaries. 

4. Applications. In this section we describe how our framework can be 
adapted to give solutions to three important practical problems: detecting 
expressed genes, estimation of differential expression, and identification of 
synthetic lethality and fitness defects in yeast mutants. In each section we 
briefly describe the scientific problem, the way our framework will be im- 
plemented, a dataset used to assess the performance of our approach, and 
results comparing our approach to standard ones. 

4.1. Detecting expressed genes. 

4.1.1. Scientific problem. For any given target sample, it is not likely 
that transcripts from all genes are present. Determining which transcripts 
are present is sometimes of scientific interest. The Affymetrix default soft- 
ware (MAS 5.0) includes an algorithm for the detection of expressed genes 
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using GeneChip arrays. The results are summarized as detection calls that 
can take the values absent (A), marginal (M) and present (P). Using our 
framework, one can construct an algorithm by viewing the problem as test- 
ing the hypothesis 

E[5™]=0 for alii = 1,...,J3, 

for each gene g on each array i, with the alternative hypothesis E[S'^j^] > 
(remember Sgjj^ > by definition). For this application we also assume that 

O is constant within array. Because the variability of the /x™ across {g,j) 
has been demonstrated to be very large [Wu et al. (2004)], this problem is 
not trivial. The solution offered by Affymetrix can be derived by assuming 
^™ = /x^f and = for ah probes [Hubbeh, Liu and Mei (2002) and 
Liu et al. (2002)]. Under these assumptions, ^[1"™ - Yg^^] = under the 
null hypothesis. A Wilcoxon test on Rg = (y™ - Y^^)/{YgP^ + Y^^) is 
performed on the Jg observations to obtain a p-value^. The default behavior 
of MAS 5.0 is to assign a P, M or A call to a p-value smaller than 0.4, 
between 0.4 and 0.6, and bigger than 0.6 respectively. Liu et al. (2002) 
demonstrate that the algorithm works relatively well in practice. However, 
in this section we demonstrate that our framework can be used to generate 
similar detection calls with almost half of the number of probes. 

4.1.2. Our solution. Empirical results do not support the assumptions 
= and S*^^ = 0. There is strong evidence that S*^^ > for many 
probes [Irizarry et al. (2003a)] and that 7^ /"gfi^ [Naef and Magnasco 

(2003) and Wu et al. (2004)]. Notice that if we can not use the MM probes, 
then we need to have probe-specific information about jJiglf ■ Wu et al. (2004) 

describe methodology for estimating /U^j^ using probe sequence information. 
These authors present two strategies: one uses the MM intensities and the 
other one does not. In both instances, fJ^^j is estimated by fitting a 15- 
parameter model to hundreds of thousands of probes, thus, it is estimated 
with enough precision to consider it known in this application. If we treat 
/i^jj as constant, then testing the null hypothesis without using the MM 
probes is straight forward. Notice that GeneChip arrays include one MM 
for each PM, thus, PM-only arrays can represent twice as many genes at the 
same price or represent the same genes at half-price. We therefore refer to our 
approach without using the MMs as the half-price procedure. Notice that 
the commercial arrays created by NimbleGen [Singh-Gasson et al. (1999)] 
and the new exon arrays from Affymetrix do not include MM probes. The 
half-price procedure will permit users of these arrays to perform detection 
calls. 



^MAS 5.0 software tests the null hypothesis that median(i?g) = r versus the alternative 
hypothesis that median(_Rg) > r for a positive constant r. The default is r = 0.015. 
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4.1.3. Assessment data. To compare the two approaches, we used Affyme- 
trix's spike-in experiment on the HG-U133 platform. This experiment is sim- 
ilar to the one described in Irizarry et al. (2003b) and Cope et al. (2004). In 
this experiment transcripts from 42 genes were artificially added or spiked- 
in to a complex cRNA target at 14 different concentrations ranging from 
to 512 picoMolar. Fourteen different mixtures were formed by varying 
the concentrations following a Latin-square design. Three replicates of these 
mixtures were formed and hybridized to 42 GeneChip arrays of the same 
type. The 42 spiked-in genes were known not to be present in the original 
cRNA target, thus, if their spike- in concentration was 0, then the correct 
detection call is A. For all other concentrations the correct call is obviously 
P. 

4.1.4. Results. Figure 2 compares the results obtained using Affymetrix's 
default procedure and our proposed approach. Figure 2A shows ROC curves 
for detecting target presence for genes, at various concentrations ranging 
from 0.125 to 2 picoMolar, using our approach (solid lines) using MM data or 
MAS 5.0 (dashed lines). Using both the PM and MM measurements as MAS 
5.0, our approach tops the result from MAS 5.0. Figure 2B compares the 
result from the half-price approach (solid line) with MAS 5.0 (dotted lines). 
Using nearly half the information, the half-price approach achieves similar 
sensitivity and specificity as MAS 5.0, suggesting this is a useful alternative 
when no MM measurements are available. In fact, the Affymetrix Exon 
arrays have adopted a design without MM probes. Another advantage of our 
approach is that it can easily be extended to include replicates and compute 
one p-value for each gene under each condition. The Affymetrix algorithm 
assigns Present / Absent calls to each gene on each array. Common practice to 
dealing with replicates is to call a gene "present" when the number of present 
calls exceed an arbitrary proportion of the replicates within a condition. 

4.2. Estimating differential expression. 

4.2.1. Scientific problem. In this application we typically have two classes 
of samples (e.g., experimental and control) and in many cases we have var- 
ious replicates. We are interested in measuring differential expression for 
each gene. Currently, the standard approach is to first preprocess the probe- 
level data and then use statistical procedures developed for gene-level data 
[Chu, Weir and Wolfinger (2002), Dudoit et al. (2002), Kerr et al. (2002), 
Kerr, Martin and Churchill (2000), Lee et al. (2000), Lonnstedt and Speed 
(2002), Newton et al. (2001), Schena et al. (1996), Tusher, Tibshirani and Chu 
(2001), Wolfinger et al. (2001), Yang et al. (2002)] without consideration of 
the preprocessing algorithm. In this example we will use data from GeneChip 
arrays. 
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Fig. 2. (A) Receiver operating curves for detection calls from MAS 5.0 {dashed lines) 
and our approach {solid lines), when both PM and MM measurements are used. The 
concentrations of the present genes are 0.125, 0.25, 0.5, 1 and 2 picoMolar. (B) Same as A 
except the half-price approach {solid lines) is compared to MAS 5.0 {dashed lines). 

4.2.2. Our solution. In this context, we quantify differential expression 
by defining On = /3o,g + Pi^gXi, with = 1 if array i was hybridized to the 
experimental target, and Xj = otherwise. The parameter of interest will be 
Pi,g- For this application we use the MM probes only to estimate f^gij- To do 
this, we assume that S^j^ = which, as seen below, does yield useful results. 
To reduce the number of parameters needed to represent the probe-specific 
mean levels, we use probe sequence information as described by Wu et al. 
(2004). In summary, the fi^^j and 4>g^j are assumed to be linear functions of 
indicator variables denoting what base (G, C, T or A) is in each position 
of the probe. We assume the base effect is a smooth function of position 
and use splines with 5 degrees of freedom to model these functions. These 
assumptions reduce the number of parameters from hundreds of thousands to 
less than 20. See Wu et al. (2004) for more details. Other minor assumptions 
about across and within array correlations are described in the Appendix. 

With the specifics of the model in place, we are ready to estimate (3i^g. 
A possibility is to obtain the MLE along with a standard error for this es- 
timate. Alternatively, we can pose a Bayesian model and obtain posterior 
distributions of the Pi^g. Wu et al. (2004) demonstrate that normal distri- 
bution is a reasonable assumption for in model (3). Rocke et al. (2001) 
show that normal distribution is a good approximation for £gij as well. The 
total intensity of nonspecific and specific binding, N + S, is therefore a con- 
volution of two log-normally distributed random variables. This convolution 
has a complex likelihood and makes it computationally impractical to ob- 
tain MLE or Bayesian estimates for each of the thousands of genes on an 
array. On the other hand, the first two moments of Ngij and Sgij are easy 
to compute given the parameters. Therefore, we use generalized estimat- 
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ing equations which rely on simply the first two moments. Details of the 
implementation are in the Appendix. 

Notice that, unlike the modular approach taken by Liu et al. (2006) to 
propagate the measurement error of expression level summaries into differen- 
tial expression detection, in this framework, expression level measurements 
for each array are never calculated. Instead, the parameter of interest is 
calculated along with a measure of uncertainty that includes the effects of 
background adjustment, normalization and summarization. We refer to the 
procedure that leads to an estimate (3i^g and its standard error as the unified 
approach. 

4.2.3. Assessment data. To demonstrate the utility of the unified proce- 
dure, we use data from the spike-in experiment described in Section 4.1.3. 
Recall that the RNA samples in this experiment were the same in all hy- 
bridizations except for the spiked- in genes. The spiked-in genes varied in 
concentration, within and across arrays. This implies that we can find com- 
parisons of arrays for which only 42 genes are expected to be differentially 
expressed. Furthermore, for various comparisons, we had three technical trip- 
licates in each group. We choose comparisons of two triplicates for which the 
expected fold-changes, for most of the spiked-in genes, was 2. 

4.2.4. Results. Figure 3A shows (3i^g plotted against the average log ex- 
pression level (taken across the six arrays) for each gene. Notice that this 
provides similar information to an MA-plot. The blue bars denote point- 
wise critical values for rejecting the hypothesis that Pi^g = at the 0.01 
level. These critical values are computed using the fact that our estimates 
are asymptotically normal. Nonspiked in genes, which are known not to be 
differentially expressed, exceeding these bounds are shown with red stars. 
Spiked-in genes are shown with large purple dots. We add detection call 
information (described in Section 4.1) for all other points. Yellow represents 
low value (present gene), red represents large p- value (absent gene). In 
this case the null hypothesis was that the genes were absent in all six hy- 
bridizations. A common approach used by biologists is to filter genes with 
Affymetrix produced absent calls and then compute fold change estimates. 
Figure 2 demonstrates that this will result in many false negatives. We pro- 
pose looking at both fold change estimates and p-values in one plot such as 
Figure 3A. Because we are adding P/A call information to an MA-plot, we 
refer to this as an MA- PA plot. 

Figure 3A demonstrates that a procedure calling genes differentially ex- 
pressed when they are outside the critical value bounds performs rather well. 
Figures 3B compares our results to those obtained with the commonly used 
approaches, the t-test and the linear models for microarray data (LIMMA) 
[Smyth (2004)]. LIMMA is one of the most popular procedures for detecting 
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A tJlA-PA plot B ROC plot 
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Fig. 3. (A) /3i.g plotted against + /3o,g)/2. The color of each dot represents the 
p-value. Yellow represents low p-value (present genes), red presents large p-value (absent 
genes). The blue bars mark <I)(0.995)SE and (j)(0.005)SE , where (j)(-) is the cumulative 
density function of Normal(0,l) . Spiked-in genes are labeled as big purple points. (B) 
Averaged receiver operating curves from 14 comparisons of 2-condition with 3-replicates 
each from the GeneChip spike-in experiment. 

differentially expressed genes among biologists and was designed specifically 
for this application. This method requires expression-level data, thus, we 
demonstrate results obtained using two popular preprocessing algorithms: 
RMA and MAS 5.0 (Affymetrix's default). Figure 3B shows average ROC 
curves for the five procedures obtained from 14 three versus three compar- 
isons. To imitate real data, we excluded comparisons with unrealistically 
large nominal fold changes and with high nominal concentrations. Specifi- 
cally, only comparisons with nominal fold changes of 2 and nominal concen- 
trations smaller than 4 picoMolar were included. The ROC curve demon- 
strates that our procedure performs better than the modular approaches. 
Figure 4 shows log fold change estimates obtained with the three proce- 
dures and demonstrates that our unified approach provides estimates with 
less bias. 

Figure 4 demonstrates that for low nominal concentrations the unified 
approach estimates have larger variances. However, model based standard 
error estimates account for this fact. Figure 5 plots our model-based standard 
error estimate against observed average log intensity for each gene. We also 
plot sample standard deviations, calculated from 42 arrays, of Pi^g for various 
strata of the average log intensity. The model-based standard errors, based 
on three replicates, are very close to the sample standard errors. Notice 
the strong dependence of both standard error estimates on the average log 
intensity. This dependence is predicted by our model. Equation (4) in the 
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Fig. 4. (A) Estimated differential expression for genes with nominal fold change of 2 
obtained with MAS 5.0. The x-axis shows the lower of the two nominal concentrations 
involved m the comparison. (B) As (A) hut using RMA. (C) As (A) hut with our estimate 



Appendix shows that the standard error is proportional to the inverse of 
£[5*]. This provides a plausible alternate explanation to the common claim 
made by many biologists that the high variation observed for low abundance 
genes is a biological reality. Our calculations show that the high variation is 
due to the statistical manipulations needed to correct for background. 

4.3. Identification of synthetic lethality and fitness defects in yeast mu- 
tants. 




1 2 3 4 5 
Average log {base 2) expression 



Fig. 5. Model hased {hlue hars) and empirical standard deviations [red crosses) of jSi^g 
as described in the text. 
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4.3.1. Scientific problem. The Yeast Deletion strain collection was cre- 
ated by an international consortium of yeast geneticists [Giaeve et al. (2002)] 
and is an invaluable resource for genetics research. For each of the 6000+ 
genes in the yeast genome, a mutant yeast strain was created missing that 
gene. Some genes are essential and, thus, the mutants are not viable. Two 
unique DNA tags were incorporated into the genome of each mutant strain. 
Recently, two-channel microarray technology has been developed containing 
the necessary probes to detect the tags [Yang et al. (2005)]. Thus, Microar- 
ray hybridization can be used to measure the representation of each mutant 
in a complex mixture of many different mutants. 

A new collection of mutant yeast that are missing two genes is being 
created. Of interest is to find pairs of nonessential genes for which removing 
both causes lethality or fitness to grow defects. In a typical hybridization, 
various tags will be missing in the experimental target, these represent dead 
yeast, and present in the control target, these represent live yeast. Mutants 
with fitness defects will be under-represented in the experimental target. 
The task is to identify these tags using the microarray data. 

4.3.2. Our solution. Because of financial constrains (for each of the 4000+ 
nonessential genes we need a hybridization), we will typically have only one 
array 1 = 1 per query gene. As mentioned, two tags are used to represent 
each gene; thus, we have two probes per mutant, that is, J = 2 for all g. 
Because the yeast mutants are either dead or alive, we will model 9g with 
a two component mixture distribution. One component will represent the 
dead mutant, that is, S^j = for h = R,G, the other will represent the live 
mutants. Figure 6 A plots a density estimate of log intensities for both R and 
G channels and clearly shows both alive and dead components. This figure 
motivates the assumption that 6g follows a normal distribution for the alive 
mutants. Furthermore, the figures also supports our claim that the normal 
assumption for is useful. 

Once the model is fitted under these assumptions, we are ready to pro- 
vide useful summaries. To quantify the evidence for a gene being dead in the 
experimental target and alive in the control, we compute a likelihood ratio 
comparing a model where and Y'^ come from different mixture compo- 
nents to a model where they come from the same. For mutants that appear 
to be alive in both cultures we can estimate the difference in representation 
log(0,^)-log(ef). 

4.3.3. Assessment data. One mixture of yeast DNA was split into two 
halves, and into each half DNA from a few selected mutants were spiked in 
with known concentration ratio. The concentrations were chosen so that (1) 
some mutants were not represented in the experimental pool and represented 
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Fig. 6. (A) Density estimates of log intensities from the Green {dashed green line) and 
Red channels (solid red line). (B) Log intensity ratios plotted against the log likelihood 
ratios described in the text. 

in the control and (2) some mutants had known fold changes in represen- 
tation when comparing both samples. The spike-in material was introduced 
into the hybridization mixture in three different concentration groups (high, 
medium and low). See Peyser et al. (2005) for more details. 

4.3.4. Results. In Figure 6B we show the log likelihood ratios of mutants 
that had the same representation (imitating alive/alive or dead/dead) or 
were spiked-in only in one sample (imitating dead/alive) plotted against 
the naive log-ratio statistic. This figure shows that the log likelihood ratio 
statistic clearly discriminates the dead/alive mutants from the rest. Various 
number of these genes would not have been detected had we used the log 
ratio. 

In Figure 7 we show box-plots of the MLE of log(^^) — log{9^) for the 
genes that were spiked in to be differentially represented stratified by concen- 
tration groups. In this figure we also show estimates obtained using two stan- 
dard preprocessing procedures. The first is what we refer to as the default 
procedure which background corrects using the direct estimates of back- 
ground noise and normalizes by the log ratio medians. The second is the 
approach proposed by [Dudoit et al. (2002)]. The figure demonstrates our 
estimation procedure offers an improvement in accuracy and precision over 
the other two. As in the previous example, the uncertainty introduced by the 
background adjustment and normalization can be included with our result. 

5. Discussion. We have presented a general statistical framework for the 
analysis of microarray data. The advantages of our framework, as any model- 
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Fig. 7. (A) Box-plot of log fold change estimates using the default preprocessing algorithm 
for the low, medium and high concentration groups. The fourth box-plot shows the log fold 
change estimates of genes that were not spiked-in. (B) ^4^ (A) but with a popular alternative 
preprocessing algorithm. (C) As (A) but with our model-based estimate. 

based inference, depends on the validity of our assumptions. However, we 
believe it to be a general enough framework for it to be relevant in many 
microarray applications, and targeted enough to be useful in practice. We 
have demonstrated the flexibility of our proposal with three examples from 
three very different applications and two different platforms. These examples 
are not intended to be final solutions to the specific problems we presented 
but rather examples of the adaptability of the proposed framework. An 
immense amount of work has been published in the statistics literature for 
both preprocessing and higher-level analyses of microarray data. Our hope 
is that our work will serve as a basic infrastructure that will permit the 
integration of these two bodies of work. 

The first step in building a unified model is to describe deterministic ef- 
fects and all sources of random variation. This allows a complete description 
of the data structure. However, keeping all these error terms in the model 
often increases the computational cost. In many practical situations, empiri- 
cal evidence suggests feasible simplifications. For example, we have observed 
that the variance due to optical noise is mostly negligible compared to the 
variance due to nonspecific binding background. Therefore, in the examples 
we have presented in this paper, we have assumed that O is a constant. 
In general, we have assumed that the random effects and e are corre- 
lated among repeated measures. The strength of these correlations depends 
on the relative magnitude of between and within probe variance. In cases 
where biological variation is large, the within probe variances across arrays 
are higher and these correlations will be smaller than what we encounter 
in technical replicates, as in the Latin-square dataset. In our example for 
estimating differential expression, we do not differentiate the sources of cross 
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array variance and let the data determine plug-in estimates of the correlation 
coefficients. Rattray et al. (2006) take a different approach by propagating 
the uncertainty of gene expression measures obtained in preprocessing pro- 
cedures into a higher-level model, adding biological variance for each gene. 
When biological variance is dominant, this approach will probably generate 
results similar to that from a unified model. However, when the uncertainty 
from preprocessing is large, the resulting error terms can be highly correlated 
because the same probes are used on each array. Existing error propagat- 
ing approaches ignore this fact and will result in overestimates of the gene 
specific variance across samples. Last, we want to point out that both ap- 
proaches to integrating the errors from preprocessing and from biological 
replicates are expected to make a difference mostly in situations with lim- 
ited replicates. When there is sufficient replication, the benefit of integrated 
approaches is limited and may not be worth the complexity of computation. 

We view our proposal as a first step toward a general framework for mi- 
croarray probe-level data analysis. The results obtained in three independent 
applications perform well when compared to procedures that have been fine 
tuned for that specific applications. Other researchers have also proposed 
models based on this framework on other applications of microarrays. Wang 
et al. (2006a) propose a model for estimating genome-wide copy number 
that is essentially built upon this framework. Meyer et al. (2006) adapt the 
model from Wu et al. (2004) for ChlP-chip experiment to identify transcrip- 
tion factor binding sites. Results in these applications are encouraging, but 
further improvements are possible. Below we describe specific aspects that 
we plan to improve and develop in the future. 

Notice that we have purposely left image processing out of our framework. 
Our experience has been that current default image processing software is 
reliable and that alternative algorithms offer small or no advantages. In 
our opinion the added complexity required to model pixel level data is not 
merited. However, as the feature sizes on the array become smaller, and less 
pixels are available per feature, it is possible that expanding the framework 
to describe pixel level data is worth the effort. 

Normalization is one of the most controversial topics among users of the 
technology. Our approach of representing the need for normalization with 
one parameter is admittedly a simplification. Our current model is likely 
to improve by considering more elaborate approaches. Furthermore, one of 
the current limitations of microarray technology is that normalization tech- 
niques are based on assumptions related to the number of genes that are 
changing or the distributions of gene expression across samples. For the ma- 
jority of experiments these are useful assumptions, making these methods an 
invaluable resource for improving the quality of microarray data. However, 
the number of experiments in which these assumptions are not adequate is 
growing. New normalization schemes will need to be developed. Although 
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offering a specific solution to this problem is outside the scope of this paper, 
our framework is general enough to adjust to these type of data. 

Notice that in model (3), if we remove background, intensity grows lin- 
early with amount of target. Various groups, for example, Hekstra et al. 
(2003) have noticed that probe hybridization can be modeled using the 
Langmuir or Hill equations. These models predict a nonlinear behavior. In 
particular, they predict the observed phenomenon that changes are atten- 
uated for high-abundance genes. Furthermore, various groups, for example, 
Cludin et al. (2001), have noticed that scanner saturation has a similar ef- 
fect. In our experience this effect affects a very small proportion of genes 
[Irizarry, Wu and Jaffee (2006)]. However, adapting our model to account 
for these effects is possible. 

Finally, high-level analyses certainly include more than the examples 
we give in this paper. For example, gene networks describing the inter- 
action and regulation among genes are often more meaningful than infer- 
ence on single genes. Studying gene networks using microarrays often relies 
on the ability to estimate the co-regulation of gene expression. The Pear- 
son correlation of gene expression is often used as a measure of similarity 
[Getz, Levine and Domany (2000)]. However, in cases with only a few bio- 
logical samples, gene-level expression summaries do not yield reliable esti- 
mates of correlation coefficients. Using the probe level model, we can take 
advantage of the fact that each probe within a probeset respond to the same 
biological variation. Estimating correlation of gene expression from the two 
sets of probe level data can increase the efficiency over the gene-level ex- 
pression measures. 

APPENDIX 

A.l. Generalized estimating equations for GeneChip spike-in experiment. 

To define the model, we let 'Ygj = {Ygij ■,Yg2j , ■ ■ ■ , ^gij, ■ • ■ > ^gij)' denote the 
vector of PM intensities for probe j across the samples i = 1, . . . ,/. Simi- 
larly, Ngj, Sgj, $,gi, Egj denote the vectors for probe j across samples cor- 
responding to the definition in model (3), u = {i>i,i'2, ■ ■ ■ ^Vi, ■ ■ ■ ji'j)' and 
^gj ~ ^4>gj-,4>gj-, ■ ■ ■)'■ We ignore the variance in optical noise and, as explained 
in the next section, adjust for it by subtracting the minimal intensity on each 
array. We write the optical-noise-adjusted intensities as 

^gj ~ ^gj + ^gj 

= exp{(j.gj + ^gj} + exp{v + + X^e + Sgj}. 

Here = {01,02)' is the vector of the log scale expression in the two 
conditions, X is the design matrix. 

We compute plug- in estimators for fi, (pgj and f using data from the 
entire array as described in the next section. We add constraints that the 
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mean of i^s and (p of average affinity are such that 6 is identified. We use 
probe sequence information to predict /igj. However, the probe effects are 
not completely accounted for by that linear function base effect. Therefore, 
considering the same probes are used across arrays, we allow the random 
effects to be correlated: var{^gj) = S^, where S-^ = cr^ and S^, = pn(^% for 
i 7^ i' . var{egj) = S"^, where = (t| and S^, = pscr^ for i ^ i' . The and S 
notation denote the nonspecific and specific components. We assume ^ and 
e follow normal distribution and the mean of variance of Ygj is determined 
accordingly. 

We then estimate 6 for each gene g using the following generalized esti- 
mating equation: 

jEA,,(0)(y,,-E,[Y,,]) = O, 

where Agj{6) = (— ^^^)'V(^^ and Vq is a diagonal working covariance ma- 
trix. 

The asymptotic variance of 6 is 
where 

,dEg,[Yg,] 



D = E|Ag,(0o)- 



de 

n = E{Agj{Oo) YaTeo{Ygj)Ag,{eoy}. 
We estimate D and Q, with 



D- Va f«^^Mk] 



n = jAg,ie)var^{Yg,)Ag,iey. 

Notice the averaging is taken over the probes and the sample size here 
is the number of probes. Although the number of probes in a probe set is 
not very large (11-20 and typically 16 in GeneChip arrays), Figure 5 shows 
that the estimated variance based on this asymptotic result fits the observed 
variance quite well. We believe this is explained by the fact that, although 
marginal distribution of intensities is right skewed, condition on probe, the 
distribution of intensities across arrays is approximately normal. This results 
in a fast convergence of the sandwich estimator of the variance. 

An interesting application of these results is that we can illustrate the re- 
lationship between the asymptotic variance and the magnitude of 9. We con- 
sider the simplest null case where 6 = {9,9, ... , 9)' , vi = and all probes in 
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this probeset have the same probe effect. We consider a simple /c-control/A;- 
treatment comparison, therefore, X-'" is the matrix 



1 1 







1 1 



To simplify notation, we use 71 = E[Ngij\, 72 = E[Sgij\, V = waiiYgij), 
W = cov {Ygij,Ygi'j). The normal assumption about e implies 72 = e'^^+^+^'s/^ 
and ^ = 72. Therefore, 



A- 



D 



n 



72 



V ' 

± 

y2 



12x2, 



k^W 
k^ ^(k-\)W\ 



The asymptotic variance of 6 is then 

'k[V + {k-l)W] k'^W 

k^W k[V + {k-l)W] 

and the variance of 9i — 62 oc {V — W)/^2- Using the normal assumption 
again, we have V = 7^ (e'^jv — 1) + 7|(e'^s' — 1) and W = ^f{e^N"N — 1) + 
^2(gPs4 _ 1). This implies 



(4) 



var(0i — 62) oc 



7l 



which predicts that the variance of estimated differential expression con- 
verges to a constant as expression levels increase (72 increases) and is ap- 
proximately proportional to when S is small. 



A. 2. Ad-hoc plug-in estimates. For our example, we assume Ogij is con- 
stant and form an estimate O, using the minimum observed intensity on 
each array. We do this because the variance of Ogij is negligible compared 
to the variance of Ngij [Wu et al. (2004)]. To estimate figij, probe affinities 
Ugj are computed using probe sequence as described in Wu and Irizarry 
(2004). We assume that figij is a smooth function h of these affinities, that 
is, figj = h{agj), and estimate figij through estimating h. Specifically, a loess 
curve is fit to the log(y'^'''^ — O) versus scatter plot to obtain h. The 

IJ,gij are then estimated as hi{agj^). The residuals from the loess fit are 
used to estimate the variance of ^, aj^. To estimate the correlation coeffi- 
cient Pat, we identify a subset of probes with log(y™ — O) less than the 
corresponding /i. The target niRNAs of these probes are likely to be absent 
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and log{Y^^ — O) ^ N. We obtain sample variance of each probe across 
arrays and use a^g to denote the mean of these variances. pj\f is calculated 
as i^N-d-%Q)/a%. 

To estimate S'^, we first identify a subset of probe sets with high expres- 
sion level such that log(PAf) ~log(S'). Within each probe set we estimate 
the sample variance of log(PM) and use the mean as the estimate of cr|. 
Using these probes, we regress log(PM) on a^^ to predict (p for all probes. 
To estimate ps, we use a similar approach as for p^: from a subset of probes 
with strong signals, we obtain sample variance of each probe across arrays 
and set (t|q as the mean of those variances, ps is calculated as (o"| — o"|q) / (t| . 
We estimate 6 first under 1^ = 0. The normalization parameters are then 
estimated such that the 6i — 02 has weighted mean 0, with the weights from 
estimated standard errors of 6i — 62- 

For genes that are not expressed in at least one condition, 9 = —00 and 
the GEE may not converge in obtaining 9. For these genes we compute a 
p-value testing the hypothesis that it is absent under both conditions. We 
compute p- value for differential expression for all other genes based on the 
asymptotic normal distribution. 

The code implementing the GEE for GeneChip arrays is made into an R 
package uniarray and is available at http : / / www . stat . brown . edu/~zwu/ software . html. 

Acknowledgments. The authors thank Magnus Rattray and Xuejun Liu 
for their help with the pplr model. 

REFERENCES 

Amaratunga, D. and Cabrera, J. (2001). Analysis ol data Irom viral DNA microchips. 
J. Amer. Statist. Assoc. 96 1161-1170. MR1963418 

Chu, T.-M., Weir, B. and Wolfinger, R. (2002). A systematic statistical linear model- 
ing approach to ohgonucleotide array experiments. Math. Biosci. 176 35-51. MR1869191 

Chudin, E., Walker, R., Kosaka, A., Wu, S. X., Rabert, D., Chang, T. K. 
and Kreder, D. E. (2001). Assessment of the relationship between signal intensities 
and transcript concentration for Affymetrix GeneChip arrays. Genome Biol. 3 RE- 
SEARCH0005. 

Cope, L., Irizarry, R., Jaffee, H., Wu, Z. and Speed, T. (2004). A benchmark for 

Affymetrix Genechip expression measures. Bioinformatics 20 323-331. 
Cui, X., Kerr, M. K. and Churchill, G. A. (2003). Transformations for cDNA mi- 

croarray data. Statistical Applications in Genetics and Molecular Biology 2 Article 4. 

MR2011189 

DuDOiT, S., Yang, Y. H., Callow, M. J. and Speed, T. P. (2002). Statistical methods 
for identifying differentially expressed genes in replicated cDNA microarray experi- 
ments. Statist. Sinica 12 111-139. MR1894191 

DuRBiN, B. P., Hardin, J. S., Hawkins, D. M. and Rocke, D. M. (2002). A 
variance-stabilizing transformation for gene-expression microarray data. Bioinformatics 
18(Suppl. 1) S105-S110. 



24 



Z. WU AND R. A. IRIZARRY 



Geller, S. C, Gregg, J. P., Hagerman, P. and Rocke, D. M. (2003). Transformation 
and normalization of oligonucleotide microarray data. Bioinformatics 19 1817-1823. 

Getz, G., Levine, E. and Domany, E. (2000). Coupled two-way clustering analysis of 
gene microarray data. Proc. Natl. Acad. Sci. USA 97 12079-12084. 

GiAEVER, G., Chu, a. M., Ni, L., Connelly, C, Riles, L., Veronneau, S., Dow, 
S., Lucau-Danila, a., Anderson, K., Andre, B., Arkin, A. P., Astromoff, 
A., El-Bakkoury, M., Bangham, R., Benito, R., Brachat, S., Campanaro, S., 
CuRTiss, M., Davis, K., Deutschbauer, A., Entian, K. D., Flaherty, P., Foury, 
F., Garfinkel, D. J., Gerstein, M., Gotte, D., Guldener, U., Hegemann, J. H., 
Hempel, S., Herman, Z., Jaramillo, D. F., Kelly, D. E., Kelly, S. L., Kotter, 
P., LaBonte, D., Lamb, D. C, Lan, N., Liang, H., Liao, H., Liu, L., Luo, C, 
LussiER, M., Mao, R., Menard, P., Ooi, S., Revuelta, J., Roberts, C, Rose, M., 
Ross-Macdonald, p., Scherens, B., Schimmack, G., Shafer, B., Shoemaker, 
D. D., Sookhai-Mahadeo, S., Storms, R. K., Strathern, J. N., Valle, G., Voet, 

M., VOLCKAERT, G., WANG, C. Y., WARD, T. R., WiLHELMY, J., WiNZELER, E. A., 

Yang, Y., Yen, G., Youngman, E., Yu, K., Bussey, H., Boeke, J. D., Snyder, 
M., Philippsen, p., Davis, R. W. and Johnston, M. (2002). Functional profiling of 
the Saccharomyces cerevisiae genome. Nature 418 387-391. 

GOTTARDO, R., Pannucci, J. A., KuSKE, C. R. and Brettin, T. (2003). Statistical 
analysis of microarray data: A Bayesian approach. Biostatistics 4 597-620. 

Hein, A.-M., Richardson, S., Causton, H. C, Ambler, G. K. and Green, P. J. 
(2005). BGX: A fully Bayesian gene expression index for Affymetrix GeneChip data. 
Biostatistics 6 349-373. 

Hekstra, D., Taussig, A. R., Magnasco, M. and Naef, F. (2003). Absolute mRNA 
concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic 
Acids Res. 31 1962-1968. 

Hubbell, E., Liu, W.-M. and Mei, R. (2002). Robust estimators for expression analysis. 
Bioinformatics 18 1585-1592. 

HuBER, W., VON Heydebreck, A., Sultmann, H., Poustka, a. and ViNGRON, M. 
(2002). Variance stabilization applied to microarray data calibration and to the quan- 
tification of differential expression. Bioinformatics 1 1-9. 

Irizarry, R. a., B. Hobbs, F. C, Beaxer-Barclay, Y., Antonellis, K., Scherf, 
U. and Speed, T. (2003a). Exploration, normalization and summaries of high density 
oligonucleotide array probe level data. Biostatistics 4 249-264. 

Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B. and Speed, 
T. P. (2003b). Summaries of affymetrix genechip probe level data. Nucleic Acids Re- 
search 31. 

Irizarry, R. A., Wu, Z. and Jaffee, H. (2006). Comparison of affymetrix genechip 

expression measures. Bioinformatics 22 789-794. 
Kendziorski, C. M., Newton, M. A., Lan, H. and Gould, M. (2003). On parametric 

empirical Bayes methods for comparing multiple groups using replicated gene expression 

profiles. Statistics in Medicine 22 3899-3914. 
Kerr, M., Afshari, C, Bennett, L., Bushel, P., Martinez, J., Walker, N. and 

Churchill, G. (2002). Statistical analysis of a gene expression microarray experiment 

with replication. Statist. Sinica 12 203-217. MR1894195 
Kerr, M. K., Martin, M. and Churchill, G. A. (2000). Analysis of variance for gene 

expression microarray data. J. Comput. Biol. 7 819-837. 
Lee, M.-L. T., Kuo, F. C., Whitmore, G. A. and Sklar, J. (2000). Importance of 

replication in microarray gene expression studies: Statistical methods and evidence from 

repetitive cDNA hybridizations. Proc. Natl. Acad. Set. USA 97 9834-9839. 



FRAMEWORK FOR MICROARRAY PROBE-LEVEL DATA 



25 



Li, C. and Wong, W. (2001). Model-based analysis of oligonucleotide arrays: Expression 
index computation and outlier detection. Proc. Natl. Acad. Sci. USA 98 31-36. 

Liu, W., Mei, R., Di, X., Ryder, T. B., Hubbell, E., Dee, S., Webster, T. A., 
Harrington, C. A., Ho, M., Baid, ,J. and Smeekens, S. P. (2002). Analysis of 
high density expression microarrays with signed-rank call algorithms. Bioinformatics 
18 1593-1599. 

Liu, X., Milo, M., Lawrence, N. D. and Rattray, M. (2006). Probe-level measurement 
error improves accuracy in detecting differential gene expression. Bioinformatics 22 
2107-2113. 

Lonnstedt, L and Speed, T. (2002). Replicated microarray data. Statist. Sinica 12 31- 
46. MR1894187 

Meyer, C, Gottardo, R., Carroll, J., Brown, M. and Liu, X. (2006). Model-based 
analysis of tiling-arrays for chip-chip. Proc. Natl. Acad. Sci. 103 12457-12462. 

Naef, F. and Magnasco, M. O. (2003). Solving the riddle of the bright mismatches: 
Labeling and effective binding in oligonucleotide arrays. Phys. Rev. E 68 011906. 

Newton, M., Kendziorski, C, Richmond, C, Blattner, F. and Tsui, K. (2001). On 
differential variability of expression ratios: Improving statistical inference about gene 
expression changes from microarray data. J. Comput. Biol. 8 37-52. 

Pan, W., Lin, J. and Le, C. (2003). A mixture model approach to detecting differentially 
expressed genes with microarray data. Functional Integrative Genomics 3 117-124. 

Peyser, B. D., Irizarry, R. A., Tiffany, C, Chen, O., Yuan, D. S., Boeke, J. D. 
and Spencer, F. A. (2005). Improved statistical analysis of budding yeast tag microar- 
rays revealed by defined spike-in pools. Nucieic Acids Res. 33 40. 

Rattray, M., Liu, X., Sanguinetti, G., Milo, M. and Lawrence, N. D. (2006). 
Propagating uncertainty in microarray data analysis. Briefings in Bioinformatics 7 37- 
47. 

ROCKE, D. M. and Durbin, B. (2001). A model for measurement error for gene expression 
arrays. J. Comput. Biology 8 557-569. 

ScHENA, M., Shalon, D., Heller, R., Chai, a.. Brown, P. and Davis, R. (1996). 
Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. 
Proc. Natl. Acad. Sci. USA 93 10614-10619. 

Singh-Gasson, S., Green, R. D., Yue, Y., Nelson, C, Blattner, F., Sussman, 
M. R. and Cerrina, F. (1999). Maskless fabrication of light-directed oligonucleotide 
microarrays using a digital micromirror array. Nature Biotechnology 17 974-978. 

Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differ- 
ential expression in microarray experiments. Statistical Applications in Genetics and 
Molecular Biology 3 Article 3. MR2101454 

TuSHER, v., Tibshirani, R. and Chu, C. (2001). Significance analysis of microarrays 
applied to ionizing radiation response. Proc. Natl. Acad. Sci. USA 98 5116-5121. 

Wang, W., Caravalho, B., Miller, N., Pevsner, J., Chakravarti, A. and Irizarry, 
R. A. (2006a). Estimating genome- wide copy number using allele specific mixture mod- 
els. Working Papers 122, Dept. of Biostatistics, Johns Hopkins University. Available at 
http : //www.bepress . com/ jhubiost at /paper 122 . 

Wang, X., He, H., Li, L., Chen, R., Deng, X. W. and Li, S. (2006b). Nmpp: A user- 
customized nimblegen microarray data processing pipeline. Bioinformatics 22 2955- 
2957. 

Wolfinger, R., Gibson, G., Wolfinger, E., Bennett, L., Hamadeh, H., Bushel, 
P., Afshari, C. and Paules, R. (2001). Assessing gene significance from cDNA mi- 
croarray expression data via mixed models. J. Comput. Biol. 8 625-637. 



26 



Z. WU AND R. A. IRIZARRY 



Wu, Z. and Irizarry, R. (2004). Stochastic models inspired by hybridization theory for 
short ohgonucleotide arrays. In Proceedings of RECOMB 2004. J- Comput. Biol. 12 
882-893. 

Wu, Z., Irizarry, R., Gentlemen, R., Martinez-Murillo, F. and Spencer, F. 

(2004) . A model-based background adjustment for oligonucleotide expression arrays. 
J. Amer. Statist. Assoc. 99 909-917. MR2113309 

Wu, Z. and Irizarry, R. A. (2005). A statistical framework for the analysis of microarray 
probe-level data. Working papers, Dept. Biostatistics, Johns Hopkins Univ. Available 
at http : //www.bepress . com/jhubiostat/paper73. 

Yang, I. V., Chen, E., Hasseman, J. P., Liang, W., Frank, B. C, Wang, S., 
Sharov, v., Saeed, a. I., White, ,J., Li, ,J., Lee, N. H., Yeatman, T. J. and 
QuACKENBUSH, J. (2002). Within the fold: Assessing differential expression measures 
and reproducibility in microarray assays. Genome Biology 3 research0062. 1-0062. 12. 

Yuan, D., Pan, X., Ooi, S., Peyser, B., Spencer, F., Irizarry, R. and Boeke, J. 

(2005) . Improved microarray methods for profiling the yeast knockout strain collection. 
To appear. 

Department of Community Health Department of Biostatistics 

Brown University Johns Hopkins University 

121 South Main St. Center for Statistical Sciences 615 N. Wolfe St. E3620 

Providence, Rhode Island 02906 Baltimore, Maryland 21205 



USA 

E-MAIL: zhijin_wu@brown.edu 



USA 

E-MAIL: rafaOjhu.cdu 



